LG – 机器学习 CV – 计算机视觉 CL – 计算与语言

329次阅读

LG – 机器学习 CV – 计算机视觉 CL – 计算与语言

1、[CL] UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining
2、[LG] Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning
3、[LG] Masked Siamese ConvNets: Towards an Effective Masking Strategy for General-purpose Siamese Networks
4、[LG] VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
5、[CV] 3D-aware Conditional Image Synthesis
[LG] A Survey of Geometric Optimization for Deep Learning: From Euclidean Space to Riemannian Manifold
[LG] A Nonstochastic Control Approach to Optimization
[CV] Denoising Diffusion Probabilistic Models for Robust Image Super-Resolution in the Wild
[CV] MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation

摘要：面向大规模多语言预训练的更公平更有效的语言抽样、深度学习集成知识蒸馏和自蒸馏的理解、基于弱监督局部特征对齐的视觉-语言Transformer、3D感知条件图像合成、深度学习几何优化综述、优化的非随机控制方法、面向真实场景鲁棒图像超分辨率的去噪扩散概率模型、基于融合扩散路径的可控图像生成

1、[CL] UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining

H W Chung, X Garcia, A Roberts, Y Tay, O Firat, S Narang, N Constant
[Google]

UniMax: 面向大规模多语言预训练的更公平更有效的语言抽样

要点:

在不同模型规模下，UniMax在多语言预训练中的表现优于基于温度的采样；
UniMax 对高资源语言的覆盖更均匀，并减轻了对低资源语言的过拟合；
提出一种改进和刷新的 mC4 多语言语料库变体，包括107种语言的29万亿字符。

一句话总结:
UniMax 是一种用于大规模多语言预训练的新的语言抽样方法，能更均匀地覆高资源语言，同时减轻低资源语言的过拟合。

Pretrained multilingual large language models have typically used heuristic temperature-based sampling to balance between different languages. However previous work has not systematically evaluated the efficacy of different pretraining language distributions across model scales. In this paper, we propose a new sampling method, UniMax, that delivers more uniform coverage of head languages while mitigating overfitting on tail languages by explicitly capping the number of repeats over each languages corpus. We perform an extensive series of ablations testing a range of sampling strategies on a suite of multilingual benchmarks, while varying model scale. We find that UniMax outperforms standard temperature-based sampling, and the benefits persist as scale increases. As part of our contribution, we release an improved and refreshed variant of the mC4 multilingual corpus consisting of 29 trillion characters across 107 languages. In addition we release full code to reproduce our experiments.

https://openreview.net/forum?id=kXwdL1cWOAi

2、[LG] Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning

Z Allen-Zhu, YLi
[Meta AI & CMU]

深度学习的集成、知识蒸馏和自蒸馏的理解

要点:

独立训练的神经网络的集成，可以提高某些数据结构深度学习的测试精度；
集成的优越性能可以蒸馏成一个单一的模型；
自蒸馏是在执行”隐性集成+知识蒸馏”；
知识蒸馏对深度学习中的随机特征映射不起作用。

一句话总结:
深度学习的集成和知识蒸馏与传统学习理论的工作机制不同，数据中的特殊结构需要神经网络集成来提高测试精度。

We formally study how ensemble of deep learning models can improve test accuracy, and how the superior performance of ensemble can be distilled into a single model using knowledge distillation. We consider the challenging case where the ensemble is simply an average of the outputs of a few independently trained neural networks with the same architecture, trained using the same algorithm on the same data set, and they only differ by the random seeds used in the initialization. We show that ensemble/knowledge distillation in deep learning works very differently from traditional learning theory (such as boosting or NTKs). We develop a theory showing that when data has a structure we refer to as “multi-view”, then ensemble of independently trained neural networks can provably improve test accuracy, and such superior test accuracy can also be provably distilled into a single model. Our result sheds light on how ensemble works in deep learning in a way that is completely different from traditional theorems, and how the “dark knowledge” is hidden in the outputs of the ensemble and can be used in distillation.

https://openreview.net/forum?id=Uuf2q9TfXGA

3、[LG] Masked Siamese ConvNets: Towards an Effective Masking Strategy for General-purpose Siamese Networks

L Jing, J Zhu, Y LeCun
[OpenAI & New York University]

掩码孪生卷积网络: 通用孪生网络的有效掩码策略

要点:

孪生网络用自监督学习来学习有用表示，而无需人工监督；
现有方法依赖于手工制作的增强，不容易适应新领域；
为具有任意主干的孪生网络(包括ConvNet)提出一种通用的掩码策略，能提高图像分类和目标检测任务的性能；
所提出的掩码孪生网络(MSCN)方法在目标检测基准上优于之前的方法。

一句话总结:
孪生网络可以从适用ConvNet的通用掩码策略中受益，提高在少样本图像分类和目标检测任务上的性能。

Siamese Networks are a popular self-supervised learning framework that learns useful representation without human supervision by encouraging representations to be invariant to distortions. Existing methods heavily rely on hand-crafted augmentations, which are not easily adapted to new domains. To explore a general-purpose or domain-agnostic siamese network, we investigate using masking as augmentations in siamese networks. Recently, masking for siamese networks has only been shown useful with transformer architectures, e.g. MSN and data2vec. In this work, we identify the underlying problems of masking for siamese networks with arbitrary backbones, including ConvNets. We propose an effective and general-purpose masking strategy and demonstrate its effectiveness on various siamese network frameworks. Our method generally improves siamese networks’ performances in the few-shot image classification, and object detection tasks.

https://openreview.net/forum?id=NnHz2rU0Hjp

4、[LG] VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

S Pramanick, L Jing, S Nag, J Zhu, H J Shah, Y LeCun, R Chellappa
[Meta AI]

VoLTA: 基于弱监督局部特征对齐的视觉-语言Transformer

要点:

VoLTA 是一种统一的 VLP 范式，利用图像标题数据和弱监督图块-标记对齐来实现细粒度的区块级图像理解，消除了对昂贵的框标注的需求；
VoLTA 在预训练期间将多模态融合深入到单模态骨干中，删除了针对融合的 Transformer 层，减少了内存需求；
VoLTA 在广泛的视觉和视觉-语言下游任务上显示出有效性，超过使用明显更多描述和框标注的方法。

一句话总结:
VoLTA 是一种新的视觉语言 Transformer 范式，无需使用昂贵的框标注就能实现细粒度的区块级图像理解。

Vision-language pre-training (VLP) has recently proven highly effective for various uni- and multi-modal downstream applications. However, most existing end-to-end VLP methods use high-resolution image-text-box data to perform well on fine-grained region-level tasks, such as object detection, segmentation, and referring expression comprehension. Unfortunately, such high-resolution images with accurate bounding box annotations are expensive to collect and use for supervision at scale. In this work, we propose VoLTA (Vision-Language Transformer with weakly-supervised local-feature Alignment), a new VLP paradigm that only utilizes image-caption data but achieves fine-grained region-level image understanding, eliminating the use of expensive box annotations. VoLTA adopts graph optimal transport-based weakly-supervised alignment on local image patches and text tokens to germinate an explicit, self-normalized, and interpretable low-level matching criterion. In addition, VoLTA pushes multi-modal fusion deep into the uni-modal backbones during pre-training and removes fusion-specific transformer layers, further reducing memory requirements. Extensive experiments on a wide range of vision- and vision-language downstream tasks demonstrate the effectiveness of VoLTA on fine-grained applications without compromising the coarse-grained downstream performance, often outperforming methods using significantly more caption and box annotations.

https://openreview.net/forum?id=26aAV_wjoc

5、[CV] 3D-aware Conditional Image Synthesis

K Deng, G Yang, D Ramanan, J Zhu
[CMU]

3D感知条件图像合成

要点:

Pix2pix3D 是一种 3D 感知条件生成模型，用于可控的逼真图像合成；
学习从不同视角合成相应的图像，给定一个 2D 标记图，如分割图或边缘图；
用神经辐射场为每个 3D 点分配一个标记，除了颜色和密度外，这使得它能同时渲染图像和像素对齐标记图；
学到的 3D 标记进一步实现了交互式 3D 交叉视图编辑。

一句话总结:
Pix2pix3D 是一种 3D 感知条件生成模型，允许用户在给定的 2D 标记图上渲染不同视角的图像，并能为每个 3D 点分配标记、颜色和密度，实现交互式 3D 跨视图编辑。

We propose pix2pix3D, a 3D-aware conditional generative model for controllable photorealistic image synthesis. Given a 2D label map, such as a segmentation or edge map, our model learns to synthesize a corresponding image from different viewpoints. To enable explicit 3D user control, we extend conditional generative models with neural radiance fields. Given widely-available monocular images and label map pairs, our model learns to assign a label to every 3D point in addition to color and density, which enables it to render the image and pixel-aligned label map simultaneously. Finally, we build an interactive system that allows users to edit the label map from any viewpoint and generate outputs accordingly.

https://arxiv.org/abs/2302.08509

另外几篇值得关注的论文：

[LG] A Survey of Geometric Optimization for Deep Learning: From Euclidean Space to Riemannian Manifold

Y Fei, X Wei, Y Liu, Z Li, M Chen
[East China Normal University]

深度学习几何优化综述: 从欧几里德空间到黎曼流形

要点:

黎曼流形上的几何优化，可以解决深度学习中众所周知的问题，如梯度消失或梯度爆炸；
调研了将几何优化应用于深度学习的最新进展，包括基本过程、各种几何优化器和黎曼流形的概念；
调研了不同深度学习网络在各种AI任务中的应用，讨论了在流形上实现优化的典型开放工具箱；
几何优化可以掌握搜索空间的几何信息，加快优化过程，缓解梯度爆炸和梯度消失问题，但在用于未开发的深度学习方法和流形结构时仍有挑战。

一句话总结:
黎曼流形上的几何优化可以缓解深度学习中的挑战，但挑战依然存在，特别是对于未开发的深度学习方法和流形结构。

Although Deep Learning (DL) has achieved success in complex Artificial Intelligence (AI) tasks, it suffers from various notorious problems (e.g., feature redundancy, and vanishing or exploding gradients), since updating parameters in Euclidean space cannot fully exploit the geometric structure of the solution space. As a promising alternative solution, Riemannian-based DL uses geometric optimization to update parameters on Riemannian manifolds and can leverage the underlying geometric information. Accordingly, this article presents a comprehensive survey of applying geometric optimization in DL. At first, this article introduces the basic procedure of the geometric optimization, including various geometric optimizers and some concepts of Riemannian manifold. Subsequently, this article investigates the application of geometric optimization in different DL networks in various AI tasks, e.g., convolution neural network, recurrent neural network, transfer learning, and optimal transport. Additionally, typical public toolboxes that implement optimization on manifold are also discussed. Finally, this article makes a performance comparison between different deep geometric optimization methods under image recognition scenarios.

https://arxiv.org/abs/2302.08210

[LG] A Nonstochastic Control Approach to Optimization

X Chen, E Hazan
[Princeton University]

优化的非随机控制方法

要点:

选择优化的超参数是个非凸问题；
元优化可以表述为在线非随机控制问题，从一类方法中学习最佳优化算法；
采用凸松弛在线非随机控制的最新技术，可用于获得对最佳离线解的遗憾保证；
所提出的算法方法实现了在各种优化设置中收敛到最佳状态和遗憾最小化。

一句话总结:
提出一种用于数学优化中元优化的非随机控制方法，获得了对最佳离线解的遗憾保证，实现了在确定性和随机性优化中收敛到最佳状态，并在在线学习设置中实现遗憾最小化。

Selecting the best hyperparameters for a particular optimization instance, such as the learning rate and momentum, is an important but nonconvex problem. As a result, iterative optimization methods such as hypergradient descent lack global optimality guarantees in general.
We propose an online nonstochastic control methodology for mathematical optimization. First, we formalize the setting of meta-optimization, an online learning formulation of learning the best optimization algorithm from a class of methods. The meta-optimization problem over gradient-based methods can be framed as a feedback control problem over the choice of hyperparameters, including the learning rate, momentum, and the preconditioner.Although the original optimal control problem is nonconvex, we show how recent methods from online nonstochastic control using convex relaxations can be used to circumvent the nonconvexity, and obtain regret guarantees vs. the best offline solution. This guarantees that in meta-optimization, given a sequence of optimization problems, we can learn a method that attains convergence comparable to that of the best optimization method in hindsight from a class of methods.

https://arxiv.org/abs/2301.07902

[CV] Denoising Diffusion Probabilistic Models for Robust Image Super-Resolution in the Wild

H Sahak, D Watson, C Saharia, D Fleet
[Google Research]

面向真实场景鲁棒图像超分辨率的去噪扩散概率模型

要点:

SR3+是一种基于扩散的盲超分辨率模型，在具有挑战性的、分布不均的输入图像上的表现超过了最先进的GAN模型；
使用自监督训练，结合复合、参数化退化和噪声调节增强；
SR3+ 是鲁棒的，能以可控方式生成逼真的纹理，在自然图像上表现出色，在其他图像上也有合理的表现；
随着模型规模的增加和数据集的扩大，性能可以得到明显的改善。

一句话总结:
SR3+ 是个扩散模型，通过将自监督训练与参数化退化和噪声调节增强相结合，在盲超分辨率任务上取得了最先进的性能。

Diffusion models have shown promising results on single-image super-resolution and other image- to-image translation tasks. Despite this success, they have not outperformed state-of-the-art GAN models on the more challenging blind super-resolution task, where the input images are out of distribution, with unknown degradations. This paper introduces SR3+, a diffusion-based model for blind super-resolution, establishing a new state-of-the-art. To this end, we advocate self-supervised training with a combination of composite, parameterized degradations for self-supervised training, and noise-conditioing augmentation during training and testing. With these innovations, a large-scale convolutional architecture, and large-scale datasets, SR3+ greatly outperforms SR3. It outperforms Real-ESRGAN when trained on the same data, with a DRealSR FID score of 36.82 vs. 37.22, which further improves to FID of 32.37 with larger models, and further still with larger training sets.

https://arxiv.org/abs/2302.07864

[CV] MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation

O Bar-Tal, L Yariv, Y Lipman, T Dekel
[Weizmann Institute of Science]

MultiDiffusion: 基于融合扩散路径的可控图像生成

要点:

MultiDiffusion 是一种用预训练扩散模型进行可控图像生成的框架；
不需要进一步的训练或微调，可应用于各种不同的生成任务；
生成过程产生一个可有效计算的优化任务，同时确保收敛到目标的全局最优；
即使与那些为特定任务定制训练的方法相比，MultiDiffusion 也能产生最先进的结果。

一句话总结:
MultiDiffusion 是一个框架，能实现多功能和可控的图像生成，不需要进一步的训练或微调，甚至对需要用户提供控制的任务，也能产生最先进的结果。

Recent advances in text-to-image generation with diffusion models present transformative capabilities in image quality. However, user controllability of the generated image, and fast adaptation to new tasks still remains an open challenge, currently mostly addressed by costly and long re-training and fine-tuning or ad-hoc adaptations to specific image generation tasks. In this work, we present MultiDiffusion, a unified framework that enables versatile and controllable image generation, using a pre-trained text-to-image diffusion model, without any further training or finetuning. At the center of our approach is a new generation process, based on an optimization task that binds together multiple diffusion generation processes with a shared set of parameters or constraints. We show that MultiDiffusion can be readily applied to generate high quality and diverse images that adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes. Project webpage: this https URL

https://arxiv.org/abs/2302.08113

正文完

可以使用微信扫码关注公众号（ID：xzluomor）

AI

发表至： AI AI论文

2023年3月3日

0

一文读懂Vibe Coding：AI时代的编程新范式，让创意无需被代码束缚

LG – 机器学习 CV – 计算机视觉 CL – 计算与语言

a panda bear baking a cake in a sunny kitchen, digital art

人+Agent协作：打破技术迷思，才是真正的第四次产业革命

LG – 机器学习 CV – 计算机视觉 CL – 计算与语言

LG – 机器学习 CV – 计算机视觉 CL – 计算与语言

1、[CL] UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining

2、[LG] Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning

3、[LG] Masked Siamese ConvNets: Towards an Effective Masking Strategy for General-purpose Siamese Networks

4、[LG] VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

5、[CV] 3D-aware Conditional Image Synthesis

[LG] A Survey of Geometric Optimization for Deep Learning: From Euclidean Space to Riemannian Manifold

[LG] A Nonstochastic Control Approach to Optimization

[CV] Denoising Diffusion Probabilistic Models for Robust Image Super-Resolution in the Wild

[CV] MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation

2026最新｜Claude Code 保姆级安装教程（Windows/Mac/Linux 全覆盖，零门槛）

2026最新！Ubuntu 保姆级安装 Docker 教程（含镜像加速+免sudo权限）

5分钟搞定！Hermes Agent 全平台保姆级安装教程（Mac/Linux/Windows WSL2）

从零吃透ADB命令！Android调试必备高频指令大全

AI 编码提效神器：agent‑skills 项目全解析